Significance and model comparison

Uncertainty and model comparison

Assume we have the correct model (this is a big assumption!)

  • How should we talk about things like uncertainty and error?

  • How do we make an argument for one model over another?

Uncertainty and model comparison

  • P.values

  • Confidence Intervals

  • S-Values and model comparisons

  • Bayesian alternatives

P-values

What does a p-value mean?

Can this British lady do wild stuff with her tea?

Muriel Bristol

Fisher’s proposed experiment:

  • 8 cups of tea. 4 milk first, and 4 tea first

  • Each cup presented in random order, and she’s asked to classify each one

  • The “null hypothesis”: she’s just guessing

What does a p-value mean?

The p-value is “the probability of correctly classifying at least N cups by guessing”

  • At least 1 correct \(= 69/70 = 99\%\)

  • At least 2 correct \(= 53/70 = 76\%\)

  • At least 3 correct \(= 17/70 = 24\%\)

  • 4 correct \(= 1/70 = 1.43\%\)

What does a p-value mean?

  • In the broadest sense, p-values indicate the probability of getting a result >, <, or absolutely greater than as the one in your sample if the null hypothesis is true.

  • This is a statement about the probability of a data given a model, not a statement about the probability of a model.

    • In frequentist statistics, parameters are fixed and hypotheses are either true or false, so it doesn’t make sense to talk about the “probability of a model”.

One tail or two?

  • Pr(>|t|) = .225 There is a 22.5% chance of seeing a result as extreme as our sample statistic if the null hypothesis is true

  • Pr(>t) = .04 There is a 4% chance of seeing a result greater than the sample statistic if the null hypothesis is true

  • Pr(<t) = .95 There is a 95% chance of seeing a result less than the sample statistic if the null hypothesis is true

P-values and Type I error

.05 level means that around 1 in 20 of results would be false positives

What does a p-value not mean?

  • \(P(t | h_0) \neq P(h_1)\) P values are not the probability of the alternative
  • Significance does not mean the null is false
  • Statistical significance doesn’t mean practice significance.

Regression p-values

In regression output, the p-value typically indicates the probability of seeing a T-statistic as extreme as the one in your sample if the null hypothesis is true:

voteshares$income_over_65k<- voteshares$median_income>65000
gof_stats <- c(
  "nobs"
)
coef_names <- c("income_over_65kTRUE" = "Median income over >$65K"
                )
model_1 <-lm(demshare ~ income_over_65k, data=voteshares)
modelsummary(list("DV: 2020 statewide Biden %" = model_1), 
             statistic = c( "p.value"),
             gof_map = 'nobs',
             coef_rename = coef_names,
             notes = "Note: p.values in parentheses"
             )
tinytable_fwavfttk5lj1v717k288
DV: 2020 statewide Biden %
Note: p.values in parentheses
(Intercept) 0.445
(<0.001)
Median income over >$65K 0.126
(<0.001)
Num.Obs. 51

Importantly, this assumes that the sampling distribution of coefficients is approximately normal.

Regression p-values: randomization inference

There are some methods for relaxing this assumption, however. One method, mostly used for experiments, is randomization inference, where a binary treatment is randomly “shuffled”:

  • Ideally, we would calculate all possible permutations of the treatment, but this is not feasible for samples with more than a handful of observations.
  • After shuffling, we re-run the regression and calculate an effect size. Then we see how often estimated effects after shuffling are more extreme than the ones we observe.
model_1 <-lm(demshare ~ income_over_65k, data=voteshares)
obs_effect<-model_1$coefficients[2]

set.seed(888)
perms<-50000
effects<-numeric(length = perms)
for(i in 1:perms){
  permuted <- sample(voteshares$income_over_65k)
  mod <-lm(demshare ~ permuted, data=voteshares)
  
  est_effect<-coef(mod)[2]
  effects[i] <- est_effect
  
}

mean((effects >= obs_effect) | (effects <= obs_effect * -1)) * 100
[1] 0.008

Regression p-values: randomization inference

  • The resulting p-value is said to be “non-parametric” because it doesn’t require an assumption of normality.
  • This means that RI p-values are robust to problems like heteroskedasticity.

Confidence intervals: bootstrapping

Another non-parametric method is the bootstrap, which relies on re-sampling observations with replacement and then re-running the regression model.

  • Although the more common and intuitive application is to use the bootstrap to calculate a confidence interval…
model_1 <-lm(demshare ~ income_over_65k, data=voteshares)

iterations<-10000
results<-vector('list', length=iterations)
model_data<-model_1$model

for(i in 1:iterations){
  booted<-model_data|>slice_sample(n = nrow(model_data), replace=T)
  
  mod <-lm(demshare ~ income_over_65k, data=booted)
  results[[i]] <- tidy(mod)
  
}
result_frame<-bind_rows(results, .id='iteration')
result_frame|>
  group_by(term)|>
  summarize(
    lb = quantile(estimate, .025),
    est = median(estimate),
    ub = quantile(estimate, .975),
    p = min(mean(estimate>0), mean(estimate<0)) * 2
    )
termlbestubp
(Intercept)0.417 0.4450.4740
income_over_65kTRUE0.06280.1260.1920

::: :::::

P-values: caveats

  • P1. p-values can indicate how incompatible the data are with a specified statistical model.
  • P2. p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  • P3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  • P4. Proper inference requires full reporting and transparency.
  • P5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  • P6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis

From: McShane, B. B., & Gal, D. (2017). Statistical significance and the dichotomization of evidence. Journal of the American Statistical Association, 112(519), 885-895.

Problems with P-values

  • Often misinterpreted, even by experts

  • They’re continuous, but they tend to be treated as binary and conflated with importance

  • In large samples, they’re hard to read because they get infinitesimally small

Confidence intervals

What do confidence intervals mean?

In repeated sampling, 95% of the 95% confidence intervals will contain the correct value of \(\beta\)

Confidence Intervals

  • Its misleading to say there’s a 95% chance that the population \(\beta\) falls between the ci boundaries. (frequentist stats assumes that \(\beta\) is fixed, so it doesn’t have a “chance” of doing anything.)

  • Better to say “I’m 95% certain this range contains the true value”

Confidence Intervals

  • Pro: P.values and Confidence Intervals are based on the same basic assumptions, but CIs can be more informative about uncertainty and don’t lend themselves to the same dichotomous interpretation that a p.value does.

  • Pro: Automatically scaled to the quantity of interest, so harder to confuse with a measure of substantive importance

  • Pro: can be used for equivalence testing (we’ll come back to this)

  • Con: also commonly misinterpreted!

Alternatives: S-Values

One proposed alternative: S-values AKA Surprisal or Shannon Entropy

\[\text{S value} = -log_2(\text{P})\]

  • Intuitive interpretation: S-values correspond to the probability of flipping a fair coin X times and having them all land on heads. So S=4 translates to flipping a coin 4 times and everything landing on heads.

Increasing values as evidence gets stronger, and the sizes are less extreme:

P value S Value
0.9 0.01
0.06 4.05
0.05 4.32
0.01 6.64
0.001 10
0.000000001 30
1e-100 332

Alternatives: S-Values

  • Pro: Still based on the same assumptions as a p-value, its just a transformation of the p.value that might be more intuitive for people.

  • Pro: easy calculation (for a computer, at least)

  • Con: Not widely adopted. All of this stuff relies on convention.

Bayesian methods

  • Bayesian methods potentially allow us to do the thing we wish we could do with p-values: quantify hypothesis probabilities.

  • One downside is that it sort of requires everyone to adopt a totally different philosophical approach to probabilities: Bayesian statistics assume that the model parameters also have a distribution of their own, and we use data to update our knowledge about the nature of those distributions.

  • Another downside is that Bayesian inference requires us to come up with “prior” distributions for model parameters, which invariably involves some subjectivity.

Bayes Factors Approximation

Bayes Factors measure the weight of evidence in favor of one hypothesis relative to another.

Wagenmakers (2007) suggests a frequentist approximation of a Bayes Factor using the Bayesian Information Criterion.

  • Run a model with the variable of interest (model 1)

  • Run a model without the variable of interest (model 0)

  • Calculate the BIC for both

  • Calculate \(\frac{\exp(\text{BIC}_1)/2}{\exp(\text{BIC}_0)/2}\)

Bayes Factors Approximation

ff<-read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/causaldata/gov_transfers.csv")

model_0<-lm(Support ~  Income_Centered + Education + Age, data=ff)

model_1<-lm(Support ~  Participation + Income_Centered + Education + Age, data=ff)





exp(-BIC(model_1)/2)/exp(-BIC(model_0)/2)
[1] 7.509412

Numbers greater than 1 indicate evidence in favor of model 1, numbers less than 1 indicate evidence against model 1.

Comparing Models

  • If I have multiple models, how do I tell whether one is better than the other?

“Good” models can be bad predictors

R-squared

R-squared can be used for model comparison, but it doesn’t tell you a model is a good predictor.

Comparing models

  • Make a fair comparison: our “restricted” model \(Y_i = \bar{Y} + e_i\)
  • Our “unrestricted” comparison model: \(Y_i = \alpha + \beta_1X + e_i\)
  • How much can we improve relative to the baseline? Stated differently: our null hypothesis is that the restricted model is just as good as our new one.

Comparing models

Predicting Democratic voteshare using the 2016 share

statedemsharedemshare_2016GEOIDpopmedian_agewhite_popbachelors_degreemedian_incomeincome_over_65k
Alabama0.3710.356014.89e+0639.23.3e+06 5.47e+055.2e+04 FALSE
Alaska0.4470.416027.37e+0534.64.67e+059.02e+047.78e+04TRUE
Arizona0.5020.481047.17e+0637.95.29e+069.11e+056.15e+04FALSE
Arkansas0.3580.357053.01e+0638.32.27e+063.09e+054.95e+04FALSE
California0.6490.661063.93e+0736.72.21e+075.76e+067.87e+04TRUE
Colorado0.5690.527085.68e+0636.94.63e+061.02e+067.52e+04TRUE

Comparing models

\[ F = \frac{(RSS_{restricted} - RSS_{full})}{df_{restricted} - df_{full}} \div \frac{RSS_{full}}{df_{full}} \]

model_0 <- lm(demshare~ 1, data=voteshares)

model_1 <-lm(demshare ~ 1 + demshare_2016, data=voteshares)

anova(model_0, model_1)
Res.DfRSSDfSum of SqFPr(>F)
500.758                   
490.0089610.7494.1e+036.81e-49

Comparing models

We can continue to include additional covariates in the model. But the “fair comparison” here is probably the same model -1 covariate (rather than comparing to the null model)

Res.DfRSSDfSum of SqFPr(>F)
500.758                      
490.0089610.749  4.08e+034.25e-48
480.0088110.000150.816   0.371   

BIC and AIC

  • BIC stands for Bayesian Information Criteria.
  • AIC stands for Aikake Information Criteria.
  • Both values are based on the model likelihood, which, for linear models, is based on the sum of squared residuals:

\[{\text{BIC} =n\ln(RSS/n)+k\ln(n)\ }\]

\[{\text{AIC} =n\ln(RSS/n)+2k)\ }\]

Note that both the AIC and BIC include a penalty for k, which is the number of model parameters. So both will penalize models with garbage predictors.

BIC and AIC

Lower values of both AIC and BIC indicate better model fits:

BIC(model_1)
[1] -284.4724
BIC(model_2)
[1] -281.4003
AIC(model_1)
[1] -290.2679
AIC(model_2)
[1] -289.1276

Out of sample prediction

  • Basic method: set aside some observations, create model on the training set and then predict the test set

  • For small samples, K-fold or LOOCV are preferable: leave out one, predict, and then do it again. Then average the metric.

library(caret)
# Define training control
train.control <- trainControl(method = "LOOCV")
# Train the model
cv_1 <- train(demshare ~ demshare_2016, data = voteshares, 
               method = "lm",
               trControl = train.control)

Out of sample prediction

Linear Regression 

51 samples
 1 predictor

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 50, 50, 50, 50, 50, 50, ... 
Resampling results:

  RMSE        Rsquared   MAE       
  0.01380351  0.9871893  0.01096933

Tuning parameter 'intercept' was held constant at a value of TRUE

Out of sample prediction

…but is this the right metric?

Arguing for a null

Conventional hypothesis testing “stacks the deck” in favor of finding no relationship, and failure to reject the null could simply be a function of small sample sizes.

Still, we often want to dismiss a claim:

  • Can citizen pressure campaigns make states more effective?

  • Do motor voter laws improve turnout?

  • Are some people actually psychic?

TOST and equivalence testing

“Two one-sided tests”

Step 1: Set a minimum effect size (\(\Delta\)) worth caring about

Step 2: Conduct one-sided t-test for:

  • H01: \(\Delta \leq −Δ_l\)

  • H02: \(\Delta \geq Δ_u\)

If both of these tests are rejected, then we can conclude the observed effect is smaller than the minimum effect size.

TOST and equivalence testing

  • Eskine (2013) found that people exposed to organic foods became morally judgmental.

  • Moery and Calin-Jageman (2016) attempt to replicate this finding

https://www.nbcnews.com/news/world/does-organic-food-turn-people-jerks-flna779676

TOST and equivalence testing

Moery and Calin-Jageman’s replication results look like this:

Group N Mean SD
Control 95 5.25 0.95
Treatment 89 5.22 0.83

TOST and equivalence testing

The estimated effect: 5.25 - 5.22 = 0.03

Minimal effect size \(\Delta = 0.43\)

What is the probability of seeing a mean difference greater than or equal to 0.03 if the true difference is 0.43?

What is the probability of seeing a mean difference less than or equal to to 0.03 if the true difference is -0.43?

TOST and equivalence testing

There’s also an equivalent method that relies on 90% confidence intervals instead of T.tests:

  • Calculate a 90% confidence interval for the effect
  • Check if \(\Delta_{lb} \geq Δ_l \text{ and } \Delta_{ub} \leq Δ_u\) (i.e.: is the entire 90% ci inside of the equivalence range?)
  • If so, then we can conclude that the observed effect is not statistically different from our range of equivalence.

Null effects and Duverger’s law

Rainey:

  • “Hypothesis 1: Increasing the effective number of ethnic groups from the 10th percentile (1.06) to the 90th percentile (2.48) will not lead to a substantively meaningful change in the effective number of political parties when the district magnitude is one.”

  • “Hypothesis 2: Increasing the district magnitude from one to seven will not lead to a substantively meaningful change in the effective number of political parties when the effective number of ethnic groups is one.”

  • Substantively meaningful effect: changing the effective number of parties by 0.62 (this is the difference in the effective number of political parties between the U.S. and the U.K.)

Null effects and Duverger’s law

Marginal effects and quantities of interest

  • For simple regression models, we can get most of the interesting stuff from just looking at the output, but what if we’re working with something more complex?

For instance, this model includes a polynomial term for percent Bachelor’s degrees in a given state. What is the effect of a one unit increase in Bachelor’s degrees, and is this statistically significant?


Call:
lm(formula = demshare ~ poly(pct_bachelors, 2, raw = T), data = voteshares)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.188633 -0.071404  0.007378  0.046200  0.214193 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)  
(Intercept)                       0.595663   0.391311   1.522   0.1345  
poly(pct_bachelors, 2, raw = T)1 -0.057409   0.058086  -0.988   0.3279  
poly(pct_bachelors, 2, raw = T)2  0.003591   0.002125   1.690   0.0975 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08616 on 48 degrees of freedom
Multiple R-squared:  0.5299,    Adjusted R-squared:  0.5103 
F-statistic: 27.05 on 2 and 48 DF,  p-value: 1.356e-08

Marginal effects and quantities of interest

There’s not a straightforward way to calculate an effect or a confidence interval here analytically, so we need to use other methods.

One option is bootstrapping, another is simulation.

      2.5%        50%      97.5% 
0.03353758 0.04420986 0.05611655 

The marginal effects package will do this automatically:


 Estimate Std. Error    z Pr(>|z|)    S  2.5 % 97.5 %
   0.0441    0.00606 7.28   <0.001 41.4 0.0322  0.056

Term: pct_bachelors
Type: response
Comparison: +1

Marginal effects and quantities of interest